Model Selection

Multimodal Fusion

# Multimodal Fusion

Videollama2.1 7B AV CoT

VideoLLaMA2.1-7B-AV is a multimodal large language model focused on audio-visual question answering tasks, capable of processing both video and audio inputs to provide high-quality question answering and description generation.

Transformers English

Hunyuanvideo I2V

Hunyuan Video-I2V is a novel image-to-video generation framework, extended from Tencent's Hunyuan Video model, supporting high-quality video generation from static images.

Vit Bart Image Captioner

A vision-language model based on BART-Large and ViT for generating English descriptions of images.

Image-to-Text English

SD3.5 Large IP Adapter

This is an IP adapter based on the SD3.5-Large model, capable of using images as input conditions alongside text prompts to generate new images.

Text-to-Image English

Sdxl.ip Adapter

IP-Adapter is an image prompt adapter for text-to-image diffusion models, capable of combining image prompts with text prompts to enhance the relevance and quality of generated images.

Text-to-Image Other

AA Chameleon 7b Base

A multimodal model supporting interleaved text-image input/output, based on Chameleon 7B model with enhanced image generation capabilities through the Align-Anything framework

Transformers English

LinFusion is a text-to-image generation model based on diffusion models, capable of producing high-quality images from textual descriptions.

A multilingual audio-visual speech recognition model based on the MuAViC dataset, combining audio and visual modalities for robust performance

Chattime 1 7B Base

ChatTime is an innovative multimodal time series foundation model that treats time series as a foreign language, unifying the processing of bimodal input/output for both time series and text.

Multimodal Fusion

ConsistentID is a multimodal fine-grained identity-preserving portrait generation model capable of producing portraits with extremely high identity fidelity while maintaining diversity and text controllability.

Text-to-Image Other

Music Generation Model

This is a hybrid model created by merging a text generation model and a music generation model, capable of handling both text and music generation tasks.

Instructblip Flan T5 Xxl 8bit

BLIP-2 is a vision-language model based on Flan T5-xxl, pretrained by freezing the image encoder and large language model, supporting tasks like image caption generation and visual question answering.

Transformers English

Mediocreatmybest

YOLO LLaMa 7B VisNav

This project integrates the YOLO object detection model with the LLaMa 2 7B large language model, aiming to provide navigation assistance for visually impaired individuals in their daily travels.

Multimodal Fusion

Timesformer Bert Video Captioning

A video caption generation model based on Timesformer and BERT architectures, capable of generating descriptive captions for video content.

Blip2 Flan T5 Xxl

BLIP-2 is a vision-language model that combines an image encoder with a large language model for image-to-text tasks.

Transformers English

LanguageMachines

Fusecap Image Captioning

FuseCap is a framework specifically designed for generating semantically rich image captions, leveraging large language models to produce fused image descriptions.

Raos Virtual Try On Model

A virtual try-on system built on the Stable Diffusion framework, integrating DreamBooth training, EfficientNetB3 feature extraction, and OpenPose pose detection technologies

Image Generation

BBS-Net is a deep learning model for RGB-D salient object detection, employing a bifurcated backbone strategy network architecture that effectively processes RGB and depth image data.

Image Segmentation

Blip2 Flan T5 Xxl

BLIP-2 is a vision-language model that combines an image encoder with the large language model Flan T5-xxl for image-to-text tasks.

Transformers English

Blip2 Opt 2.7b Coco

BLIP-2 is a vision-language pretrained model that guides language-image pretraining by freezing the image encoder and large language model.

Transformers English

BLIP-2 is a vision-language model based on OPT-6.7b, pretrained by freezing the image encoder and large language model, supporting tasks such as image-to-text generation and visual question answering.

Transformers English

Blip2 Flan T5 Xl

BLIP-2 is a vision-language model based on Flan T5-xl, pre-trained by freezing the image encoder and large language model, supporting tasks such as image captioning and visual question answering.

Transformers English

A stable diffusion-based text-to-image generation model supporting creative image generation

Image Generation English

Wav2vec2 2 Bart Large

This model is an automatic speech recognition (ASR) model fine-tuned on the librispeech_asr-clean dataset, based on wav2vec2-large-lv60 and bart-large

Speech Recognition

patrickvonplaten

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase